Settings

DiMSum version: 1.2.11

Project name: Bri2_12NNK_03

Run Started: 2023-06-23 17:33:52

Run Completed: 2023-06-23 17:51:13

Command-line arguments:

## runDemo                 FALSE
## fastqFileDir            /users/blehner/bbolognesi/mmartin/Bri2_ADan_ABri/Bri2_NNK/
## fastqFileExtension      .fastq
## gzipped                 TRUE
## stranded                TRUE
## paired                  TRUE
## barcodeErrorRate        0.25
## experimentDesignPath    /users/blehner/bbolognesi/mmartin/Bri2_ADan_ABri/Bri2_NNK/experiment_design_03.txt
## experimentDesignPairDuplicates
##                         FALSE
## cutadapt5First          CAAATTTGCCGTGGAAACTTTAATTTGT
## cutadapt5Second         GGTGGCGGCCGCTCTAGATTA
## cutadaptMinLength       36
## cutadaptErrorRate       0.2
## cutadaptOverlap         3
## vsearchMinQual          30
## vsearchMaxee            0.5
## vsearchMinovlen         10
## outputPath              /users/blehner/bbolognesi/mmartin/Bri2_ADan_ABri/Bri2_NNK/
## projectName             Bri2_12NNK_03
## wildtypeSequence        TGGAAGGGGACGATGGTTGTGGGTAGTAATTGGCCG
## permittedSequences      NNKNNKNNKNNKNNKNNKNNKNNKNNKNNKNNKNNK
## reverseComplement       FALSE
## sequenceType            coding
## mutagenesisType         codon
## transLibrary            FALSE
## transLibraryReverseComplement
##                         FALSE
## bayesianDoubleFitness   FALSE
## bayesianDoubleFitnessLamD
##                         0.025
## fitnessMinInputCountAll
##                         0
## fitnessMinInputCountAny
##                         100
## fitnessMinOutputCountAll
##                         0
## fitnessMinOutputCountAny
##                         0
## fitnessHighConfidenceCount
##                         10
## fitnessDoubleHighConfidenceCount
##                         50
## fitnessNormalise        TRUE
## fitnessErrorModel       TRUE
## indels                  none
## maxSubstitutions        12
## mixedSubstitutions      TRUE
## retainIntermediateFiles
##                         TRUE
## splitChunkSize          3758096384
## retainedReplicates      all
## startStage              1
## stopStage               5
## numCores                10

Pipeline stages

The DiMSum pipeline consists of five stages grouped into two modules which can be run independently:

  • WRAP (Stages 1-3) processes raw FastQ files generating a table of variant counts
  • STEAM (Stages 4-5) analyses variant counts generating variant fitness and error estimates

Below you will find summary plots with results of each stage corresponding to the module(s) that were run.

1. QC raw reads (WRAP)

DiMSum Stage 1 (QC) summarises base qualities from each raw FastQ file using FastQC.

The plot below shows 10th percentile (upper) and mean (lower) Phred quality scores at the indicated positions in the forward reads (Read 1) in all FastQ files (see legend).

If mean read qualities are low (Phred score<30) in the constant region sequence, it might be necessary to increase the maximum allowable number of mismatches during trimming i.e. Stage 2 ('cutadaptErrorRate'). If qualities are low in the variable region sequence, it may be necessary to adjust Stage 3 (ALIGN) options ('vsearchMinQual', 'vsearchMaxee'). Be aware that changing these options from their defaults can severely impact the number of 'fake' (spurious) variants due to sequencing errors. See DiMSum documentation for details.

The plot below is similar to the one above except quality scores for reverse reads (Read 2) are shown.

2. TRIM constant regions (WRAP)

DiMSum Stage 2 (TRIM) removes constant region sequences at the start (5') and/or end (3') of each read with Cutadapt if required.

The plot below shows the percentage of forward reads (Read 1) in which the specified constant regions were matched and trimmed (see legend), shown separately for each FastQ file.

Untrimmed reads (or read pairs) are discarded if constant region sequences are specified but not found. Trimmed reads are also discarded if the trimmed sequence length is too short ('cutadaptMinLength'). If the percentage of trimmed reads is low, check that constant region sequences were correctly specified ('cutadapt5First', 'cutadapt5Second', 'cutadapt3First', 'cutadapt3Second'). It may also be necessary to increase the maximum allowable number of mismatches ('cutadaptErrorRate') if sequence qualities are low or decrease the minimum allowable overlap between read and constant region ('cutadaptOverlap') if constant region sequences are very short (<3bp). See DiMSum documentation for details.

The plot below is similar to the one above except trimming statistics for reverse reads (Read 2) are shown.

3. ALIGN PE reads (WRAP)

DiMSum Stage 3 (ALIGN) aligns paired-end reads using VSEARCH. This stage also filters the resulting variant sequences based on minimum base quality, total number of expected base calling errors and sequence length. If reads are the result of single-end sequencing, these same filters are applied.

The plot below shows the total percentage of reads (or read pairs) retained for downstream analysis ('vsearch_aligned'), shown separately for each FastQ file. Remaining reads are discarded. Details of each category are as follows:

  • 'vsearch_aligned' (retained)
  • 'vsearch_no_alignment_found' (discarded: no alignment found)
  • 'vsearch_too_many_diffs' (discarded: >10 mismatches in the alignment)
  • 'vsearch_overlap_too_short' (discarded: alignment is too short, see 'vsearchMinovlen' option)
  • 'vsearch_exp_errs_too_high' (discarded: total number of expected base calling errors too high, see 'vsearchMaxee' option)
  • 'vsearch_min_Q_too_low' (discarded: minimum base quality too low, see 'vsearchMinQual' option)
  • 'cutadapt_not_written' (discarded: read discarded in Stage 2)

If the percentage of reads retained is low (<<50%), the above options may need to be adjsted. See DiMSum documentation for details.

The plot below shows variant sequence length distributions after alignment, shown separately for all samples. The upper quartile, lower quartile and median are show in each case (see legend). Check that the median sequence length is as expected (e.g. wild-type sequence length without indels).

4. PROCESS variants (STEAM)

DiMSum Stage 4 (PROCESS) processes sequences and filters them in order to retain user-specified nucleotide or amino acid substitution variants of interest. The result is a table of variant counts for all samples. Read count diagnostic plots can then be used to rapidly check for the presence of problematic variants (likely the result of sequencing errors) and take steps to remove them (see Sections 4.1 and 4.2 below).

The plot below shows the percentage of reads retained or discarded in each sample according to the following criteria:

  • '0 hamming dist.' (retained: wild-type sequence)
  • '1 hamming dist.' (retained: 1 nucleotide substitutions from wild-type sequence)
  • '2 hamming dist.' (retained: 2 nucleotide substitutions from wild-type sequence)
  • '3+ hamming dist.' (retained: >3 nucleotide substitutions from wild-type sequence)
  • 'indel' (retained: insertion or deletion variant)
  • 'mixed' (discarded: nonsynonymous variants have synonymous substitutions in other codons, see 'mixedSubstitutions' option)
  • 'too many' (discarded: too many nucleotide or amino acid substitutions, see 'maxSubstitutions' option)
  • 'not permitted' (discarded: nucleotide substitution not permitted, see 'wildtypeSequence' option)
  • 'internal constant region' (discarded: nucleotide substitution within internal constant sequence, see 'wildtypeSequence' option)
  • 'indel discarded' (discarded: insertion or deletion variant)
  • 'invalid barcode' (discarded: reads represent barcode sequences, but are not found in the user-supplied barcode identity file, see 'barcodeIdentityPath' option)

Note: The plots below show read counts before application of user-specified count thresholds.

See DiMSum documentation for more details.

Nucleotide variant statistics (counts). The plot below is similar to the one above instead the total number of reads (rather than the percentage) in each sample is shown.

Amino acid variant statistics (percentages). The plots below are similar to the ones above instead amino acid (rather than nucleotide) hamming distances are shown.

Amino acid variant statistics (counts).

4.1 Input count distributions

The diagnostic plot below shows marginal variant count distributions separately for all Input samples, first stratified by the number of amino acid substitutions and then stratified by the number of nucleotide substitutions (Hamming distance to the wild-type sequence). Distributions corresponding to Hamming distances greater than 6 are not shown. Wild-type sequence counts are indicated by the black vertical dashed line.

Expected counts from 'fake' variants (due to base-call errors at a rate corresponding to the 'vsearchMinQual' option) are indicated by coloured dashed lines. Bimodal distributions (or unimodal distributions not surpassing the indicated thresholds) indicate variants originating from sequencing errors likely due to a library 'bottleneck'. A minimum input count threshold should be chosen to remove such variants (see 'fitnessMinInputCountAll' option applied in Stage 5 and DiMSum documentation for more details.).

Note: The plot below shows variant counts before application of user-specified count thresholds.

4.2 Sample count correlations

The diagnostic plot below is a scatterplot matrix depicting correlations between variant counts from all Input and Output samples. Matrix cells in the upper triangle show Pearson correlation coefficients. Matrix cells in the lower triangle show scatterplot equivalents (hexagonal heatmaps of 2d bin counts). Matrix diagonal cells indicate count densities.

Distinct variant populations or 'flaps' i.e. subsets of variants that appear at high counts in one replicate but at low counts in another (and not due to selection) indicate replicate or DNA extraction 'bottlenecks'. Minimum input and/or output count thresholds should be chosen to remove such variants (see 'fitnessMinInputCountAll', 'fitnessMinInputCountAny', 'fitnessMinOutputCountAll' and 'fitnessMinOutputCountAny' options applied in Stage 5 and DiMSum documentation for more details).

Note: The plot below shows variant counts before application of user-specified count thresholds.